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Abstract 

In  this  paper,  we  describe  a  method  of  execution  retry 
for  bypassing  software  faults  based  on  checkpointing,  roll¬ 
back,  message  reordering  and  replaying.  We  demonstrate 
how  rollback  techniques,  previously  developed  for  transient 
hardware  failure  recovery,  can  also  be  used  to  recover  from 
software  errors  by  exploiting  message  reordering  to  bypass 
software  faults.  Our  approach  intentionally  increases  the 
degree  of  nondeterminism  and  the  scope  of  rollback  when 
a  previous  retry  fails.  Examples  from  our  experience  with 
telecommunications  software  systems  illustrate  the  benefits 
of  the  scheme. 


1  Introduction 

Numerous  checkpointing  and  rollback  recovery  tech¬ 
niques  have  been  proposed  in  the  literature  to  recover  from 
transient  hardware  failures.  Uncoordinated  checkpointing 
schemes  [1,2]  allow  maximum  process  autonomy  and  gen¬ 
eral  nondetcrministic  execution,  but  suffer  from  potential 
domino  effects  [3].  Coordinated  checkpointing  schemes 
[4,5]  eliminate  the  domino  effect  by  sacrificing  a  certain 
degree  of  process  autonomy  and  by  paying  the  cost  of  extra 
coordination  messages.  Recently,  a  lazy  checkpoint  coor¬ 
dination  technique  [6]  has  been  proposed  as  a  mechanism 
for  bounding  rollback  propagation  and  providing  a  flexi¬ 
ble  trade-off  between  run-time  coordination  overhead  and 
recovery  efficiency. 

Log-based  recovery  provides  another  way  of  achieving 
domino-free  recovery.  Under  the  piecewise  deterministic 
model  [7],  the  domino  effect  is  avoided  through  message 
logging  and  deterministic  replaying.  In  a  pessimistic  log¬ 
ging  protocol  [8,9],  each  message  is  logged  upon  receipt 
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which  prevents  the  rollback  of  a  faulty  process  from  caus¬ 
ing  the  rollback  of  any  other  process.  Optimistic  logging 
protocols  [10-12]  have  been  proposed  to  reduce  run-time 
overhead  by  using  asynchronous  message  logging  at  the  ex¬ 
pense  of  possible  rollback  propagation  due  to  lost  volatile 
message  logs  upon  failure. 

Instead  of  proposing  another  checkpointing  and  recov¬ 
ery  protocol,  this  paper  investigates  the  possibility  of  ap¬ 
plying  the  log-based  techniques  to  recovery  from  software 
errors  [10,13-16].  We  previously  proposed  message  re¬ 
ordering  fee  changing  the  communication  pattern  at  run¬ 
time  in  order  to  reduce  the  rollback  distance  for  hardware 
failures  [17].  In  this  paper,  we  demonstrate  how  message 
reordering  can  also  provide  an  effective  way  of  bypassing 
certain  software  faults.  Fig.  1  illustrates  the  basic  con¬ 
cept.  When  a  software  error  is  detected  at  the  point  marked 
“X”,  rollback  and  message  replaying  based  on  the  complete 
checkpoint  and  message  log  information  may  lead  to  the 
same  error.  By  intentionally  discarding  part  of  the  message 
logs,  we  can  deterministically  reconstruct  the  system  state 
up  to  the  dotted  line  shown  in  Fig.  1,  and  then  use  message 
reordering  to  introduce  nondeterministic  execution  beyond 
the  dotted  line  in  order  to  bypass  the  software  fault.  Unlike 
the  recovery  block  approach  [3]  and  N -version  program¬ 
ming  [18]  which  both  use  different  programs  to  execute 
on  the  same  set  of  data,  the  above  on-line  retry  approach 
[14, 19]  uses  the  same  program  to  operate  on  a  different 
but  consistent  set  of  data  [20]  obtained  through  message 
reordering. 

Based  on  our  experience  with  telecommunications  soft¬ 
ware  systems,  the  technique  of  execution  retry  with  roll¬ 
back  and  message  replaying  has  demonstrated  its  useful¬ 
ness  for  bypassing  the  so-called  software  boundary  errors. 
Usually,  an  application  contains  a  main  routine  that  per¬ 
forms  the  designated  functions,  and  some  boundary  code 
for  handling  specific  situations,  collectively  referred  to  as 
boundary  conditions,  such  as  program  exceptions,  resource 
failures,  urgent  or  unexpected  messages,  failures  on  system 
or  function  calls,  etc.  The  boundary  code  is  often  not  well 
tested  due  to  the  difficulty  in  creating  such  boundary  condi- 


ons  in  a  test  environment  [15].  It  has  been  shown  that  the 
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2  Logical  Checkpoints  and  Recovery  Lines 


+  Checkpoint  ^  Message 


Figure  1:  Nondeterministic  execution  through  message  re¬ 
ordering. 


possibility  of  software  errors  in  the  boundary  code,  called 
software  boundary  errors,  can  be  significantly  higher  than 
that  in  the  main  routine  [IS].  These  kinds  of  software 
boundary  errors  may  cause  a  catastrophic  event  such  as 
the  AT&T  4ESS  switching  system  failure  in  January  1990 
[21].  The  fact  that  a  software  boundary  condition  usually 
occurs  very  rarely  also  suggests  that  if  a  boundary  error 
does  occur,  then  on-line  retry  by  replaying  and/or  reorder¬ 
ing  the  incoming  messages  may  be  helpful  in  bypassing  the 
boundary  condition. 


The  simplest  approach  to  execution  retry  is  to  roll  back 
the  entire  system  and  restart  from  a  consistent  global  check¬ 
point  This  can  result  in  nondeterministic  execution  in  a 
distributed  message-passing  environment  and  this  nonde- 
terminism  may  result  in  bypassing  the  boundary  condition. 
However,  it  is  often  desirable  to  limit  the  scope  of  rollback, 
the  number  of  involved  processes  as  well  as  total  rollback 
distance,  in  order  to  achieve  faster  recovery  [22].  It  is 
possible  that  a  small-scope  rollback  involving  only  a  few 
processes  suffices  for  successful  retry.  This  motivates  the 
progressive  retry  concept  which  progressively  increases  the 
scope  of  rollback  to  intentionally  introduce  more  nondeter¬ 
minism  when  a  previous  retry  fails.  Such  an  idea  has  been 
implemented  in  a  telecommunication  billing  system  and 
has  been  shown  to  improve  the  availability  of  the  systems. 
The  objective  of  this  paper  is  to  describe  and  formalize 
the  concept  of  progressive  retry  with  message  reordering 
to  bypass  software  errors  and  to  present  a  framework  for 
implementation.  The  technique  is  being  built  into  an  exist¬ 
ing  fault  tolerance  library  [23]  in  order  to  facilitate  future 
software  development 


Let  N  be  the  number  of  processes  in  the  system.  Sup¬ 
pose  pi  in  Fig.  2  initiates  a  rollback  at  the  point  marked 
“X”.  In  a  general  nondeterministic  execution,  the  rollback 
of  pi  to  its  checkpoint  C  will  unsend  messages  Mi  and  M3 , 
and  thus  require  po  to  roll  back  to  a  state  before  the  receipt 
of  Mi  in  order  to  unreceive  Mi  and  similarly  require  pz  to 
unreceive  M3;  otherwise.  Mi  and  Mr  are  recorded  as  “re¬ 
ceived  but  not  yet  smt”,  which  results  in  an  inconsistency 
of  system  state. 


logical  checkpoint 

(•)  <b) 


Figure  2:  State  consistency  (a)  example  checkpoint  and 
communication  pattern,  (b)  logical  checkpoint  dependency 
and  recovery  line. 


However,  if  pi  can  reconstruct  the  state  from  which  Mi 
was  generated2,  the  execution  of  po  based  on  the  processing 
of  Mi  is  still  valid  and  therefore  po  need  not  roll  back.  This 
can  be  achieved  by  the  piecewise  deterministic  model  and 
additional  message  logging  and  replaying.  The  piecewise 
deterministic  model  says:  process  execution  between  two 
consecutive  message  receipts,  called  a  state  interval,  is 
deterministic.  So  if  pi  has  logged  both  the  message  content 
and  the  state  interval  index  [1 1]  (i.e.,  the  processing  order) 
for  message  Mo  (but  not  for  M2)  by  the  time  it  initiates  the 
rollback,  pi  can  deterministically  reconstruct  the  state  up  to 
immediately  before  the  receipt  of  M2  (a  nondeterministic 
event)  and  therefore  Mi  remains  a  valid  message. 

A  useful  way  to  unify  these  two  seemingly  different  state 
consistency  concepts  is  to  introduce  the  notion  of  a  logical 
checkpoint  While  a  physical  checkpoint  like  checkpoint 
C  allows  the  restoration  of  process  state  at  the  point  the 
checkpoint  was  taken,  checkpoint  C  and  the  message  log 
of  Mo,  plus  the  underlying  piecewise  deterministic  model 
effectively  place  a  logical  checkpoint  at  the  end  of  the  state 
interval  started  by  Mo  (as  shown  in  Fig.  2(a))  because  of  the 
capability  of  state  reconstruction.  In  other  words,  although 
pi  “physically”  rolls  back  to  checkpoint  C,  it  “logically” 
rolls  back  to  the  above  logical  checkpoint  and  therefore 
does  not  unsend  Mi.  It  then  becomes  clear  that  while 

*Hen  we  umoe  fail-stop  [24]  future*  only  for  the  purport  of  in¬ 
troducing  logical  checkpoint*.  The  Step  4  and  Step  5  in  oar  progresiivc 
retry  technique  de*cribed  in  the  next  section  can  in  fact  relax  Rich  an 
assumption. 
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“physical  rollback  distance”  determines  the  rollback  ex¬ 
tent  of  each  individual  process,  “logical  rollback  distance” 
controls  the  extent  of  rollback  propagation  and  therefore 
the  number  of  processes  involved  in  the  recovery.  For 
simplicity,  we  let  each  physical  checkpoint  initiate  a  new 
state  interval  and  represent  it  by  a  logical  checkpoint  at  the 
end  of  that  interval.  Based  on  the  above  notion  of  logi¬ 
cal  checkpoints,  the  following  rollback  propagation  rule3 
is  then  valid  with  or  without  the  piecewise  deterministic 
model: 

if  the  sender  (logically)  rolls  back  and  unsends 
a  message  M,  the  receiver  must  also  (logically) 
roll  back  to  unreceive  M. 

We  define  a  global  checkpoint  as  a  set  of  N  (logical) 
checkpoints,  one  £rom  each  process.  A  consistent  global 
checkpoint  is  a  global  checkpoint  that  does  not  contain  any 
two  checkpoints  violating  the  above  rollback  propagation 
rule.  The  recovery  line  is  the  latest  available  consistent 
global  checkpoint  which  uniquely  minimizes  the  total  roll¬ 
back  distance  [25].  As  an  illustration,  suppose  all  the  mes¬ 
sages  in  Fig.  2(a)  except  for  M2  are  logged  when  pi  initiates 
the  rollback.  Fig.  2(b)  shows  the  dependency  graph  for  the 
available  logical  checkpoints.  By  starting  with  the  set  of  the 
last  logical  checkpoints  of  each  process  and  applying  the 
rollback  propagation  rule  described  above,  we  can  deter¬ 
mine  the  recovery  line  to  be  the  set  of  shaded  checkpoints 
in  Fig.  2(b).  Notice  that  po  may  be  required  to  physically 
roll  back  in  order  to  regenerate  the  lost  message  M2. 

3  Progressive  Retry  for  Bypassing  Software 
Errors 

We  base  our  discussion  on  the  following  system  model 
and  recovery  protocol. 

FIFO  channel:  messages  sent  along  the  same  channel  be¬ 
tween  any  two  processes  are  ordered  by  monotonically 
increasing  sequence  numbers. 

Merge  component:  messages  from  all  incoming  channels 
are  merged  by  the  merge  component  [10]  based  on  a 
changeable  merge  function,  and  are  assigned  the  state 
interval  indices. 

Logging  before  processing:  every  message  is  logged  be¬ 
fore  delivery  to  the  application  process4. 

3  In  contrast,  when  the  receiver  (logically)  rolls  back  and  unreceives 
a  message  M',  (he  sender  does  not  have  to  (logically)  roll  back  if  M' 
is  logged  at  die  tender  or  die  receiver  tide  and  can  be  retrieved  during 
reexecution,  or  i i  M'  can  be  regenerated  by  the  sendee 

4Tbe  results  can  be  extended  to  systems  with  asynchronous  (optimistic) 
message  logging  by  making  additional  logical  checkpoints  unavailable  for 
those  volatile  message  logs  lost  due  to  the  failure. 


Direct  dependency  tracking  [11,25,26]:  only  the  de¬ 
pendency  of  the  receiver’s  logical  checkpoint  on  the 
sender’s  logical  checkpoint  resulting  from  each  mes¬ 
sage  processing  is  recorded,  as  opposed  to  the  transi¬ 
tive  dependency  tracking  which  has  been  used  in  many 
log-based  papers  [10, 12]. 

Centralized  recovery  line  computation:  the  global  de¬ 
pendency  information  is  collected  by  the  process 
which  initiates  the  garbage  collection  or  recovery  pro¬ 
cedure  [2,11]  and  is  responsible  for  the  recovery  line 
computation3. 

3.1  Recovery  Line  and  Message  Logs 

With  respect  to  the  recovery  line  consisting  of  the  shaded 
checkpoints  shown  in  Fig.  3,  messages  can  be  classified  into 
four  categories. 

1.  Obsolete  messages:  In  order  to  reconstruct  the  state 
up  to  the  recovery  line,  the  system  can  restart  from  the 
set  of  restarting  checkpoints ,  called  the  restart  line, 
as  illustrated  in  Fig.  3.  Messages  that  were  processed 
before  the  restart  line,  for  example.  Mo,  are  therefore 
obsolete  messages  and  not  useful  for  recovery. 

2.  Messages  for  deterministic  replay  (deterministic 
messages):  Messages  processed  between  the  restart 
line  and  the  recovery  line  must  have  both  their  mes¬ 
sage  contents  and  state  interval  indices  logged.  These 
messages  need  to  be  replayed  in  their  original  order 
for  deterministic  state  reconstruction.  MD  and  M'D 
are  such  messages. 

3.  In-transit  (or  channel-state)  messages:  For  mes¬ 
sages  sent  before  the  recovery  line  and  processed  af¬ 
ter,  only  the  message  contents  in  the  log  are  valid. 
The  state  interval  indices  are  cither  not  logged  or  in¬ 
validated.  Messages  like  Mj  and  Mj  belong  to  this 
category  and  can  be  processed  in  arbitrary  order. 

4.  Orphan  messages:  Messages  sent  after  the  recovery 
line  are  orphan  messages.  M'R  can  not  exist  because 
otherwise  the  recovery  line  is  not  consistent.  Mr  is 
invalidated  by  the  rollback  and  should  be  discarded. 

In  an  optimistic  loggin g  protocol  [10],  rollback  propaga¬ 
tion  can  result  from  the  nondeterminism  due  to  lost  volatile 
message  logs  upon  failure.  Based  on  the  available  message 
logs  from  stable  storage,  the  recovery  line  is  uniquely  de¬ 
termined  and  each  message  must  statically  belong  to  one  of 
the  four  categories  depending  on  its  position  relative  to  the 

3  A  distributed  and  lynchronizcd  algorithm  has  been  proposed  by  Sisda 
and  Welch  [12].  A  distributed  and  asynchronous  algorithm  can  be  found 
in  Strom  and  Yemmi’s  paper  [10]. 
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Figure3:  Example  of  obsolete,  deterministic,  in-transit  and 
orphan  messages. 

recovery  line.  In  contrast,  our  retry  technique  progressively 
increases  the  degree  of  nondeterminism  and  the  scope  of 
rollback  by  discarding  more  message  logs  as  a  previous 
retry  fails.  At  each  step,  a  new  recovery  line  or  restart 
line  is  computed  based  on  the  remaining  checkpoint  and 
message  log  information.  Since  the  recovery  line  moves 
backward  in  time  during  the  progressive  retry,  messages 
belonging  to  the  ith  category  can  shift  to  the  jth  category, 
where  4  >  j  >  i  >  1,  at  a  later  stage. 


4  Progressive  Retry 

We  will  use  the  example  checkpoint  and  communication 
pattern  shown  in  Fig.  4  to  illustrate  progressive  retry  in  five 
steps. 

Step  1  -  Receiver  deterministic  retry:  When  pi  detects 
an  error,  it  first  initiates  a  local  recovery  by  rolling 
back  to  checkpoint  C  and  deterministically  replaying 
the  message  logs.  Because  every  message  is  logged 
before  processing,  message  logs  for  Ma,  Mi  and  Me 
must  be  available  and  allow  pz  to  reconstruct  the  state 
up  to  the  point  it  detected  the  error,  as  illustrated  by  the 
recovery  line  shown  in  Fig.  4(a).  In  some  cases,  tran¬ 
sient  failures  may  be  caused  by  some  environmental 
factors  which  will  simply  disappear  after  the  recovery, 
and  the  Step-1  retry  may  succeed.  If  the  reexecution 
still  leads  to  the  same  error,  the  checkpoint  and  mes¬ 
sage  log  information  is  copied  to  a  trace  file  for  off-line 
debugging  and  Step  2  is  initiated. 

Step  2  -  Receiver  nondeterministic  retry:  pi  starts  in¬ 
troducing  nondeterminism  by  discarding  the  state  in¬ 
terval  indices  of  Ma,  Mi  and  Mc  in  order  to  allow 
message  reordering.  As  a  result,  the  last  three  logical 
checkpoints  of  pz  are  now  unavailable  and  the  result¬ 
ing  recovery  line  is  shown  in  Fig.  4(b).  Notice  that 
only  Ma  and  Mi  are  in-transit  messages  available  for 
reordering;  message  Mt  as  well  as  M,  now  become 
orphan  messages  and  should  be  discarded. 


Message  reordering  can  be  achieved  by  random  re¬ 
ordering  or  by  restoring  the  in-transit  messages  to  the 
input  of  the  merge  component  and  re-assigning  them 
with  possibly  different  state  interval  indices.  An  alter¬ 
native  is  to  group  the  messages  from  the  same  process 
together  if  the  software  bug  is  possibly  due  to  the  in¬ 
terleaving  of  messages  from  different  processes.  If 
only  two  messages  are  involved,  forcing  them  into  the 
opposite  order  may  be  useful. 

Step  3  -  Sender  deterministic  retry:  The  main  purpose 
of  this  step  is  to  include  more  “future”  messages  for 
reordering  in  order  to  increase  the  effectiveness.  It  is 
useful,  for  example,  when  a  software  fault  is  triggered 
by  some  unexpected  delay  in  the  delivery  of  certain 
messages. 

Messages  that  have  arrived  at  the  receiver  but  not  yet 
been  logged  can  be  lost  upon  failure.  Message  Md 
in  Fig.  4  is  an  example.  Such  lost  messages  can  be 
detected6  when  the  receiver  receives  another  message 
from  the  same  sender  which  indicates  a  discontinuity 
in  the  message  sequence  number  [10].  The  sender  is 
then  requested  to  resend  the  message  if  sender  logging 
is  available  [10],  or  to  regenerate  the  message  through 
deterministic  state  reconstruction  [12]. 

The  immediate  recovery  of  such  lost  messages  is  use¬ 
ful  for  increasing  the  number  of  messages  available 
for  reordering,  pi  now  discards  the  message  contents 
of  Ma  and  Mi  as  well.  Although  the  resulting  recov¬ 
ery  line  as  shown  in  Fig.  4(c)  is  the  same  as  the  one  in 
(b),pj  in  addition  to  pi  and  pi  is  rolled  back7  in  order 
to  regenerate  (recover)  the  lost  message  Md. 

Step  4  -  Sender  nondeterministic  retry:  When  reorder¬ 
ing  Ma,  Mi,  Md  and  possibly  other  recently  arrived 
non-orphan  messages  still  fails  to  bypass  the  software 
fault,  pi  suspects  some  of  these  messages  should  not 
have  been  generated  in  the  first  place.  Therefore,  pi 
and  P3  are  requested  to  roll  back  further  by  discard¬ 
ing  the  state  interval  indices  of  the  message  logs  that 
can  deterministically  generate  these  messages.  The 
resulting  recovery  line  is  given  in  Fig.  4(d).  Nonde¬ 
terminism  can  be  introduced  by  pi  reordering  M„  and 
Mw,  and  pi  reordering  Mx,  My  and  Mt. 

Step  5  -  Large-scope  rollback  retry:  When  all  previous 
small-scope  retries  fail,  a  large-scope  rollback  can  be 

*For  fame  applications,  loat  messages  may  be  sccepcshle.  For  ex¬ 
ample,  if  the  loat  message  is  a  channel  request  message  in  a  telephone 
switching  application,  the  user  will  simply  redial  or  try  again  latex 
7  pi  can  notify  pj  to  roll  back  by  sending  pj  the  largest  sequence 
number  of  any  message  sent  from  pj  and  received  by  pj  before  checkpoint 
C.  Similar  messages  are  sent  to  all  other  processes. 
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Figure  4:  Progressive  retry  (a)  Step  1:  receiver  deterministic  retry  (b)  Step  2:  receiver  nondeterministic  retry  (c)  Step  3: 
sender  deterministic  retry  (d)  Step  4:  sender  nondeterministic  retarv. 


initiated.  Instead  of  backing  off  a  few  state  intervals 
for  reordering  a  small  number  of  messages  involving  a 
small  number  of  processes,  all  processes  in  the  system 
are  requested  to  roll  back  K  intervals  where  K  should 
be  a  large  number  compared  to  the  distances  involved 
in  Step  1  through  Step  4.  The  recovery  line  computed 
from  the  remaining  available  logical  checkpoints  is 
then  used  for  the  final-step  retry. 

The  choice  of  A'  is  a  trade-off  between  output  commit 
and  garbage  collection  versus  the  available  nondeter¬ 
minism.  Outputs  to  the  outside  world  that  cannot  be 
rolled  back  should  only  be  released  after  the  recovery 
line  has  advanced  beyond  the  state  intervals  that  gen¬ 
erate  these  outputs.  Checkpoints  and  message  logs 
can  only  be  garbage-collected  after  the  restart  line  has 
passed  their  corresponding  state  intervals  [12],  There¬ 
fore,  while  a  larger  K  means  more  nondeterminism  is 
available,  it  also  results  in  slower  output  commit  and 
less  effective  garbage  collection,  which  are  translated 
into  slower  response  to  the  users  and  larger  space  over¬ 
head,  respectively.  In  the  extreme  case  where  fast  out¬ 
put  commit  is  the  most  important  requirement  for  the 
system,  only  those  state  intervals  beyond  the  last  out¬ 


put  can  be  backed  off  for  introducing  nondeterminism 

[10]. 


5  Experience  and  Discussion 

In  this  section,  we  describe  two  examples  from  telecom¬ 
munications  software  with  software  boundary  errors.  By 
using  the  progressive  retry  technique  (Step  1  for  Case  1 
and  Step  3  for  Case  2),  these  programs  were  able  to  quickly 
recover  from  the  errors  without  service  interruption.  To 
simplify  the  description,  we  have  abstracted  only  the  com¬ 
ponents  which  contributed  to  the  errors. 

Case  1 

A  telecommunication  billing  system  consists  of  several 
processes  using  shared  memory  for  interprocess  communi¬ 
cation.  There  is  one  writer  process  which  updates  several 
data  structures  in  the  shared  memory  and  the  others  are 
reader  processes  which  read  these  data  structures.  Because 
no  semaphore  or  locking  mechanism  is  used  in  accessing 
the  shared  memory,  there  is  a  small  probability  that  a  reader 
may  be  accessing  the  data  structure  while  the  writer  is  up¬ 
dating  it  (e.g.,  manipulating  the  pointers  for  inserting  a  new 
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data  node).  In  such  a  case,  the  reader  receives  a  segmen¬ 
tation  violation  fault  and  is  then  recovered  by  a  watcher 
process.  Once  the  reader  is  restarted  and  tries  to  read  the 
same  data  structure  again,  the  read  operation  succeeds  be¬ 
cause  the  writer  has  finished  the  update. 

This  kind  of  error  occurs  once  every  few  days.  When¬ 
ever  it  happens,  the  Step  1  retry  mechanism  can  quickly 
correct  the  error.  An  alternative  (standard)  way  of  dealing 
with  this  problem  would  be  to  use  a  locking  mechanism 
for  accessing  the  shared  memory.  However,  using  coarse- 
grain  locks  can  result  in  unnecessary  blocking  of  the  reader 
processes,  and  using  fine-grain  locks  will  incur  large  per¬ 
formance  degradation  and  introduce  additional  software 
complexity.  Therefore,  the  billing  system  has  relied  on  the 
progressive  retry  mechanism  as  an  alternative  to  a  locking 
mechanism  to  deal  with  the  concurrency  problem.  The 
approach  is  feasible  in  this  case  because  mutual-exclusive 
conflict  only  occurs  rarely  and  it  is  detected  when  it  does 
occur. 

Case  2 

In  a  cross-connection  system,  a  Channel  Control  Mon¬ 
itor  (CCM)  process  is  used  to  track  the  available  channels 
in  the  switch.  The  CCM  process  receives  information  from 
two  other  processes:  a  Channel  Allocation  (CA)  process 
which  sends  the  channel  allocation  requests  and  a  Channel 
Deallocation  (CD)  process  which  sends  the  channel  deal¬ 
location  requests.  A  boundary  condition  for  CCM  occurs 
when  all  channels  are  used  and  the  process  receives  addi¬ 
tional  allocation  requests.  In  that  case,  a  clean-up  proce¬ 
dure  is  called  to  free  up  some  channels  or  to  block  further 
requests.  However,  the  clean-up  procedure  contained  a 
software  fault  which  could  cause  the  process  to  crash. 

The  cross-connection  system  uses  a  daemon  watcher  to 
detect  a  process  failure  and  employs  a  checkpointing  and 
message  logging  mechanism  to  recover  from  the  failure. 
The  following  example  illustrates  how  progressive  retry 
works  in  this  system.  Suppose  the  number  of  available 
channels  is  S.  The  command  r2  stands  for  requesting 
two  channels,  and  the  command  f  2  stands  for  freeing  two 
channels.  The  following  command  sequence  could  cause 
the  CCM  process  to  crash  because  of  the  boundary  error. 

CA  sends  r2  r3  rl 
CD  sends  f2  f3  fl 
CCM  receives  r2  r3  rl  and  crash 

If  the  message  f  2  is  received  and  logged  before  CCM 
crashes,  CCM  will  be  able  to  recover  by  reordering  the 
message  logs.  However,  if  CCM  crashes  before  the  f2 
message  is  logged,  reordering  messages  r2,  r3  and  rl 
(Step  2)  will  not  help.  In  this  case,  the  local  recovery  of 
CCM  fails  and  CA  and  CD  will  be  requested  to  resend 


their  messages  (Step  3).  Because  of  the  nondeterminism 
in  operating  system  scheduling  and  communication  delay, 
the  messages  may  arrive  at  CCM  in  a  different  order.  For 
example,  the  message  order  can  be 

r2  r3  f2  f3  rl  fl 

Since  the  boundary  error  does  not  occur  in  this  case,  the 
progressive  retry  involving  three  processes  succeeds. 

6  Concluding  Remarks 

We  have  described  a  method  of  applying  the  log- 
based  recovery  technique,  previously  developed  for  fail- 
stop  hardware  failures,  to  recovery  from  transient  software 
errors.  Our  five-step  progressive  retry  approach  discards 
partial  message  log  information  at  each  step  in  order  to 
introduce  an  increasing  degree  of  nondeterminism  for  by¬ 
passing  software  faults.  Although  not  every  software  error 
can  be  recovered  through  message  resending,  reordering 
and  replaying,  we  have  observed  that  progressive  retry  can 
provide  an  effective  and  economic  way  of  recovering  from 
boundary  errors  in  long-life  software  systems.  For  a  spe¬ 
cific  telecommunication  billing  system,  all  the  software 
errors  occurring  for  the  past  two  years  have  been  automati¬ 
cally  corrected  by  either  Step  1  or  Step  3  of  the  progressive 
retry  mechanism.  Fen  a  replicated  file  system,  we  have 
observed  that  Step  1  and  Step  2  were  able  to  recover  from 
90%  of  the  software  errors  fen  the  past  six  months. 

The  techniques  described  are  being  implemented  in  the 
fault  tolerance  library  libf  t  which  has  been  developed 
at  AT&T  Bell  Laboratories  [23].  Libf  t  is  a  C  library 
which  supports  N-version  programming,  recovery  blocks, 
exception  handling,  message  logging,  and  checkpointing 
and  rollback  recovery,  and  has  been  used  by  several  AT&T 
products.  Currently,  the  recovery  mechanism  in  libf  l 
provides  the  first  two  steps  of  progressive  retry,  with  the 
implementation  of  the  remaining  steps  in  progress  and  the 
investigation  of  other  useful  reordering  algorithms  as  an 
important  topic  for  future  research. 
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